Here we look at using InferCNV
to identify tumor and normal cells in SCPCS000491.
Before running this notebook, InferCNV is run using the
run-infercnv.R script. This script runs
InferCNV once with Endothelial cells as the reference cells
and once without any cells as the reference.
Here, we look the copy number variations found reported by
InferCNV and use that copy number information to call cells
as tumor or normal cells. We then compare these classifications to
manual classifications and annotations from SingleR and
CellAssign.
suppressPackageStartupMessages({
# load required packages
library(SingleCellExperiment)
library(ggplot2)
library(dplyr)
})
## Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
## 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)
# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")
sample_dir <- file.path(data_dir, "SCPCP000015", params$sample_id)
# The path to this module
module_base <- file.path(repository_base, "analyses", "cell-type-ewings")
# source in helper functions
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
# Input files
sce_filename <- glue::glue("{params$library_id}_processed.rds")
sce_file <- file.path(sample_dir, sce_filename)
# png ref files
png_file <-file.path(params$infercnv_dir, glue::glue("{params$library_id}_infercnv.png"))
# with ref cnv metadata file
cnv_metadata_file <- file.path(params$infercnv_dir, glue::glue("{params$library_id}_cnv-metadata.tsv"))
# full infercnv object with information on genes/ chr to use later
final_cnv_obj_file <- file.path(params$infercnv_dir, glue::glue("{params$library_id}_cnv-obj.rds"))
# output classifications file
infercnv_results_file <- file.path(params$results_dir, glue::glue("{params$library_id}_infercnv-classifications.tsv"))
# make sure files exist
stopifnot(
"SCE file is missing" = file.exists(sce_file),
"PNG file with heatmap is missing" = file.exists(png_file),
"CNV metadata file is missing" = file.exists(cnv_metadata_file),
"Final InferCNV object is missing" = file.exists(final_cnv_obj_file),
"Marker gene classifications are missing" = file.exists(params$marker_gene_classification)
)
# read in sce object
sce <- readr::read_rds(sce_file)
# read in cnv metadata
# use read.delim instead of readr because we want the rownames to be read in properly
cnv_metadata_df <- read.delim(cnv_metadata_file) |>
tibble::rownames_to_column("barcodes")
# read in tumor normal classifications
manual_classifications_df <- readr::read_tsv(params$marker_gene_classification)
## Rows: 12749 Columns: 2
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): barcodes, marker_gene_classification
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# check if marker gene annotations are present
if(all(is.na(manual_classifications_df$marker_gene_classification))){
has_marker_gene <- FALSE
message("No annotations were available using only marker gene expression.
Any plots comparing InferCNV to marker gene annotation will be skipped.")
} else {
has_marker_gene <- TRUE
}
# read in cnv object
final_cnv_obj <- readr::read_rds(final_cnv_obj_file)
Below is the heatmap produced by InferCNV showing the
CNV’s for each cell using normal cells as a reference. Each row is a
cell and each column is a gene. The top group contains all cells that
were used as the reference group and the bottom is all other cells. The
color indicates the relative expression intensity of each gene with the
darker colors either indicating a gain of expression or loss of
expression compared to the reference cells.
From the InferCNV
documentation:
The normal cells in the top heatmap define baseline expression for genes in normal cells. This baseline distribution of normal gene expression is subtracted from both the normal cells from which was defined as well as the tumor cells. Afterwards, the normal cell expression heatmap should be largely devoid of signal, with the exception of certain outlier gene expression values in certain cells. Removing this baseline of normal expression signal from the tumor cells should reveal those chromosomal regions that have significantly more or less expression than the normal cells, highlighting likely amplified or deleted whole chromosomes or large chromosomal regions.
The dendogram on the left side is hierarchical clustering of all
non-reference cells. The color annotation next to it indicates
subdivisions of the dendogram by cutting it in k_obs_groups
groups, with the default k_obs_groups = 1.
InferCNV uses an HMM-based model to predict whether or
not a copy number variation (either loss or gain) is present for each
chromosome in each cell. For each predicted CNV region, a posterior
probability is calculated for that CNV for each cell.
From the InferCNV
documentation:
CNV regions with mean posterior probabilities of being normal (no CNV) that are above a maximum threshold are removed as likely false positive predictions.
Before we can make any plots we need to add some metadata that contains CNV information on a per cell level to the cell metadata.
# make df for plotting
coldata_df <- sce |>
scuttle::makePerCellDF(use.dimred = "UMAP") |>
# replace UMAP.1 with UMAP1
dplyr::rename_with(
\(x) stringr::str_replace(x, "^UMAP\\.", "UMAP")
) |>
# add in per cell cnv information
dplyr::left_join(cnv_metadata_df, by = c("barcodes"))
# remove MT and GL chr, comments in docs recommend that these are not useful and can be inaccurate
chr_columns_to_keep <- which(!stringr::str_detect(names(coldata_df), "chrMT|chrGL"))
all_cnv_df <- coldata_df[, chr_columns_to_keep]
The metadata that we added includes the following columns for each chromosome:
has_cnv_chr1has_loss_chr1has_dupli_chr1proportion_cnv_chr1proportion_loss_chr1proportion_dupli_chr1proportion_scaled_cnv_chr1proportion_scaled_loss_chr1proportion_scaled_dupli_chr1The has columns are boolean columns indicating if that
cell has presence of a CNV, loss, or duplication event. If
has_cnv is TRUE if either has_loss or
has_dupli is TRUE.
See the comments in this issue explaining what the contents of the columns are.
One approach to classifying tumor cells is to look for cells that have a lot of CNVs. Let’s start by just summing up the total number of chromosomes with CNVs per cell and plotting that.
# grab any columns that have has_cnv
has_cnv_cols <- names(all_cnv_df)[startsWith(colnames(all_cnv_df), "has_cnv")]
# don't include the normal cells that were provided as reference in scaling
cnv_df <- all_cnv_df |>
# get total number of chromosomes with cnvs per cell and then scale that number
dplyr::mutate(
has_cnv = ifelse(
stringr::str_detect(subcluster, "reference"),
NA_real_,
rowSums(across(has_cnv_cols))
))
## Warning: There was 1 warning in `dplyr::mutate()`.
## ℹ In argument: `has_cnv = ifelse(...)`.
## Caused by warning:
## ! Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(has_cnv_cols)
##
## # Now:
## data %>% select(all_of(has_cnv_cols))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# plot the total number of CNVs
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = has_cnv)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw() +
scale_color_viridis_c()
Those cells in grey (bottom left) are cells used as a reference. If we were to use total CNVs we would need some sort of cut off for the number. Let’s look at the distribution of total CNVs to see if we can identify a cut off.
ggplot(cnv_df, aes(x = has_cnv)) +
geom_freqpoly(binwidth = 1, center = 0) +
theme_bw() +
labs(x = "Total number of CNVs")
## Warning: Removed 438 rows containing non-finite outside the scale range
## (`stat_bin()`).
We will call any tumor cell as those with a total number of cnvs greater than the mean(number of cnvs).
# get mean value for has_cnv, excluding any NAs
mean_cnv <- mean(cnv_df$has_cnv, na.rm=TRUE)
# add cnv classification
cnv_df <- cnv_df |>
dplyr::mutate(cnv_classification = dplyr::case_when(is.na(has_cnv) ~ "Reference",
has_cnv > mean_cnv ~ "Tumor",
has_cnv <= mean_cnv ~ "Normal"))
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = cnv_classification)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw()
Another option is to use the scaled proportion of CNVs which tells us the proportion of genes on that chromosome that are part of the CNV.
proportion_cols <- names(cnv_df)[startsWith(colnames(cnv_df), "proportion_scaled_cnv")]
# get total number of genes per chromosome to use in weighted mean
chr_weights <- final_cnv_obj@gene_order |>
as.data.frame() |>
dplyr::count(chr) |>
# only keep chr 1-22, drops MT and GL chr
dplyr::filter(chr %in% glue::glue("chr{seq(1,22)}")) |>
dplyr::pull(n)
# get mean proportion of cnvs per cell weighted by chromsome size (number of genes)
cnv_df <- cnv_df |>
dplyr::rowwise() |>
# calculate weighted mean, excluding normal cells
dplyr::mutate(
mean_proportion = ifelse(
!is.na(has_cnv),
weighted.mean(across(proportion_cols), chr_weights),
NA)) |>
dplyr::ungroup()
## Warning: There was 1 warning in `dplyr::mutate()`.
## ℹ In argument: `mean_proportion = ifelse(...)`.
## ℹ In row 2.
## Caused by warning:
## ! Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(proportion_cols)
##
## # Now:
## data %>% select(all_of(proportion_cols))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = mean_proportion)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw() +
scale_color_viridis_c()
ggplot(cnv_df, aes(x = mean_proportion)) +
geom_density() +
theme_bw()
## Warning: Removed 438 rows containing non-finite outside the scale range
## (`stat_density()`).
Let’s see how this matches up with the tumor/normal classifications we defined just using total CNVs.
In the below plot we label all non-reference cells as either tumor or normal. Cells that are NA are those that were already classified as reference cells.
ggplot(cnv_df, aes(x = mean_proportion, color = cnv_classification)) +
geom_density() +
theme_bw()
## Warning: Removed 438 rows containing non-finite outside the scale range
## (`stat_density()`).
Below we will call tumor cells based on those that have a mean scaled proportion higher than the mean of all proportions.
# get mean value for mean_proportion, excluding any NAs
all_mean_proportion <- mean(cnv_df$mean_proportion, na.rm = TRUE)
# add cnv classification
cnv_df <- cnv_df |>
dplyr::mutate(cnv_mean_classification = dplyr::case_when(is.na(mean_proportion) ~ "Reference",
mean_proportion > all_mean_proportion ~ "Tumor",
mean_proportion <= all_mean_proportion ~ "Normal"))
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = cnv_mean_classification)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw()
For the rest of the notebook, we will use the classifications of tumor cells based on the mean proportion of total CNV occurrences.
Below we can compare these annotations from InferCNV to classifying tumor cells manually with marker genes.
all_classifications_df <- cnv_df |>
dplyr::left_join(manual_classifications_df) |>
# relevel classifications for confusion matrix
dplyr::filter(cnv_mean_classification != "Reference") |>
dplyr::mutate(
cnv_classification = forcats::fct_relevel(cnv_mean_classification, "Tumor"),
marker_gene_classification = forcats::fct_relevel(marker_gene_classification, "Tumor")
)
## Joining with `by = join_by(barcodes)`
caret::confusionMatrix(
table(
all_classifications_df$cnv_classification,
all_classifications_df$marker_gene_classification
)
)
## Confusion Matrix and Statistics
##
##
## Tumor Normal
## Tumor 2602 3044
## Normal 3307 3358
##
## Accuracy : 0.4841
## 95% CI : (0.4753, 0.493)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : -0.0352
##
## Mcnemar's Test P-Value : 0.00101
##
## Sensitivity : 0.4403
## Specificity : 0.5245
## Pos Pred Value : 0.4609
## Neg Pred Value : 0.5038
## Prevalence : 0.4800
## Detection Rate : 0.2114
## Detection Prevalence : 0.4586
## Balanced Accuracy : 0.4824
##
## 'Positive' Class : Tumor
##
We can also compare to the existing annotations obtained using
SingleR and CellAssign as part of
scpca-nf.
celltype_columns <- c(
"singler_celltype_annotation",
"cellassign_celltype_annotation"
)
# create jaccard matrices for SingleR and CellAssign compared to aneuploid/diploid
jaccard_matrices <- celltype_columns |>
purrr::map(\(name) {
make_jaccard_matrix(
cnv_df,
"cnv_mean_classification",
name
)
}) |>
purrr::set_names("SingleR", "CellAssign")
# Set heatmap padding option
heatmap_padding <- 0.2
ComplexHeatmap::ht_opt(TITLE_PADDING = grid::unit(heatmap_padding, "in"))
# list of heatmaps looking at SingleR/ CellAssign vs tumor/normal
heatmap <- jaccard_matrices |>
purrr::imap(
\(celltype_mat, celltype_method) {
ComplexHeatmap::Heatmap(
t(celltype_mat), # transpose because matrix rows are in common & we want a vertical arrangement
col = circlize::colorRamp2(c(0, 1), colors = c("white", "darkslateblue")),
border = TRUE,
## Row parameters
cluster_rows = TRUE,
row_title = celltype_method,
row_title_gp = grid::gpar(fontsize = 12),
row_title_side = "left",
row_names_side = "left",
row_dend_side = "right",
row_names_gp = grid::gpar(fontsize = 10),
## Column parameters
cluster_columns = FALSE,
column_title = "",
column_title_gp = grid::gpar(fontsize = 12),
column_names_side = "bottom",
column_names_gp = grid::gpar(fontsize = 10),
column_names_rot = 90,
## Legend parameters
heatmap_legend_param = list(
title = "Jaccard index",
direction = "vertical",
legend_width = unit(1.5, "in")
),
show_heatmap_legend = celltype_method == "SingleR",
)
}) |>
# concatenate vertically into HeatmapList object
purrr::reduce(ComplexHeatmap::`%v%`) |>
ComplexHeatmap::draw(
heatmap_legend_side = "right",
# add a margin to the heatmap so labels don't get cut off
padding = unit(c(2, 20, 2, 2), "mm")
)
There are a few known copy number variations in Ewing’s sarcoma:
Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations. See Tirode et al., and Crompton et al..
First we will compare the proportions across each chromosome between cells we have labeled as tumor or normal. Then we will look specifically at the chromosomes commonly affected in Ewing sarcoma.
# prepare a data frame with one row per chromosome per cell for plotting
proportion_df <- cnv_df |>
# select only relevant columns for plotting
dplyr::select(barcodes, UMAP1, UMAP2, cnv_mean_classification, all_of(proportion_cols)) |>
tidyr::pivot_longer(cols = all_of(proportion_cols),
names_to = "chromosome",
values_to = "proportion") |>
# remove extra string in front of chr
dplyr::mutate(
chromosome = stringr::str_remove(chromosome, "proportion_scaled_cnv_chr"),
# turn into numeric for easy ordering in plot
chromosome = as.numeric(chromosome)
)
ggplot(proportion_df, aes(x = proportion, color = cnv_mean_classification)) +
geom_density(bounds = c(0,1), bw = 0.05) +
theme_bw() +
facet_wrap(vars(chromosome), ncol = 3)
## Warning: Some data points are outside of `bounds`. Removing them.
## Some data points are outside of `bounds`. Removing them.
## Some data points are outside of `bounds`. Removing them.
## Some data points are outside of `bounds`. Removing them.
Below we plot only Chr1, Chr8, Chr12, and Chr16 and look at the proportion on a UMAP.
ewings_chr_df <- proportion_df |>
dplyr::filter(chromosome %in% c(1,8,12,16))
ggplot(ewings_chr_df, aes(x = UMAP1, y=UMAP2 , color = proportion)) +
geom_point(alpha = 0.5, size = 0.5) +
theme_bw() +
facet_wrap(vars(chromosome)) +
scale_color_viridis_c()
We will save the tumor/normal classifications using both the sum and proportion.
# get a dataframe with barcodes, sum_classification and cluster_classification
celltypes_df <- cnv_df |>
dplyr::select(
barcodes,
cnv_proportion_classification = cnv_mean_classification,
cnv_sum_classification = cnv_classification
)
readr::write_tsv(celltypes_df, infercnv_results_file)
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.1.4 ggplot2_3.5.1
## [3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [5] Biobase_2.64.0 GenomicRanges_1.56.0
## [7] GenomeInfoDb_1.40.0 IRanges_2.38.0
## [9] S4Vectors_0.42.0 BiocGenerics_0.50.0
## [11] MatrixGenerics_1.16.0 matrixStats_1.3.0
##
## loaded via a namespace (and not attached):
## [1] pROC_1.18.5 rlang_1.1.3
## [3] magrittr_2.0.3 clue_0.3-65
## [5] GetoptLong_1.0.5 e1071_1.7-14
## [7] compiler_4.4.0 DelayedMatrixStats_1.26.0
## [9] png_0.1-8 reshape2_1.4.4
## [11] vctrs_0.6.5 stringr_1.5.1
## [13] shape_1.4.6.1 pkgconfig_2.0.3
## [15] crayon_1.5.2 fastmap_1.2.0
## [17] XVector_0.44.0 scuttle_1.14.0
## [19] labeling_0.4.3 utf8_1.2.4
## [21] rmarkdown_2.26 prodlim_2023.08.28
## [23] tzdb_0.4.0 UCSC.utils_1.0.0
## [25] purrr_1.0.2 bit_4.0.5
## [27] xfun_0.44 zlibbioc_1.50.0
## [29] cachem_1.0.8 beachmat_2.20.0
## [31] jsonlite_1.8.8 recipes_1.0.10
## [33] highr_0.10 DelayedArray_0.30.1
## [35] BiocParallel_1.38.0 cluster_2.1.6
## [37] parallel_4.4.0 R6_2.5.1
## [39] RColorBrewer_1.1-3 bslib_0.7.0
## [41] stringi_1.8.4 parallelly_1.37.1
## [43] rpart_4.1.23 lubridate_1.9.3
## [45] jquerylib_0.1.4 Rcpp_1.0.12
## [47] iterators_1.0.14 knitr_1.46
## [49] future.apply_1.11.2 readr_2.1.5
## [51] timechange_0.3.0 Matrix_1.7-0
## [53] splines_4.4.0 nnet_7.3-19
## [55] tidyselect_1.2.1 abind_1.4-5
## [57] yaml_2.3.8 timeDate_4032.109
## [59] doParallel_1.0.17 codetools_0.2-20
## [61] listenv_0.9.1 lattice_0.22-6
## [63] tibble_3.2.1 plyr_1.8.9
## [65] withr_3.0.0 evaluate_0.23
## [67] future_1.33.2 survival_3.5-8
## [69] proxy_0.4-27 circlize_0.4.16
## [71] pillar_1.9.0 BiocManager_1.30.23
## [73] renv_1.0.7 foreach_1.5.2
## [75] generics_0.1.3 vroom_1.6.5
## [77] rprojroot_2.0.4 hms_1.1.3
## [79] sparseMatrixStats_1.16.0 munsell_0.5.1
## [81] scales_1.3.0 globals_0.16.3
## [83] class_7.3-22 glue_1.7.0
## [85] tools_4.4.0 data.table_1.15.4
## [87] ModelMetrics_1.2.2.2 gower_1.0.1
## [89] forcats_1.0.0 grid_4.4.0
## [91] tidyr_1.3.1 ipred_0.9-14
## [93] colorspace_2.1-0 nlme_3.1-164
## [95] GenomeInfoDbData_1.2.12 cli_3.6.2
## [97] fansi_1.0.6 S4Arrays_1.4.0
## [99] viridisLite_0.4.2 ComplexHeatmap_2.20.0
## [101] lava_1.8.0 gtable_0.3.5
## [103] sass_0.4.9 digest_0.6.35
## [105] caret_6.0-94 SparseArray_1.4.3
## [107] rjson_0.2.21 farver_2.1.2
## [109] htmltools_0.5.8.1 lifecycle_1.0.4
## [111] hardhat_1.3.1 httr_1.4.7
## [113] GlobalOptions_0.1.2 bit64_4.0.5
## [115] MASS_7.3-60.2